
# PyTorch-Transformers

PyStruct-Transformers is a library developed by the HuggingFace Team that contains PyTorch implementations of popular NLP Transformers. 


## Installation ##

The required packages to run the library can be installed with the following command:

In [87]:
%%bash
pip install tqdm boto3 requests regex sentencepiece sacremoses transformers



# Example: preprocessing

Character strings should be turned into a sequence of tokens to be fed to Transformers. This operation is accomplished by a `Tokenizer`. Each model has its own tokenizer, and some tokenizing methods are different across tokenizers. The complete documentation can be found [here](https://huggingface.co/pytorch-transformers/main_classes/tokenizer.html).

In [88]:
import torch
# download the tokenizer for a specific model
# 'bert-base-cased' is a base BERT Transformer pre-trained on cased English text.
tokenizer = torch.hub.load('huggingface/pytorch-transformers', 'tokenizer', 'bert-base-cased')

text_1 = "Who was Jim Henson ?"
text_2 = "Jim Henson was a puppeteer"

# Tokenize input with `encode`. Note that tokenization sometimes split words (e.g. surnames)
print(tokenizer.encode(text_1, add_special_tokens=False))

Using cache found in /Users/andrea/.cache/torch/hub/huggingface_pytorch-transformers_master


[2627, 1108, 3104, 1124, 15703, 136]


In [89]:
# Detokenize single tokens with `decode`
print(tokenizer.decode(1124))
print(tokenizer.decode(15703))
print(tokenizer.decode(136))

He
##nson
?


In [90]:
# Detokenize a sequence of tokens 
print(tokenizer.decode(tokenizer.encode(text_1, add_special_tokens=False)))

Who was Jim Henson?


In [91]:
# encode allows to also specify pairs of sentences. 
# Special tokens can also be added around sequences 
# (for BERT: [CLS] at the beginning and [SEP] at the end)
print(tokenizer.encode(text_1, text_2, add_special_tokens=True))

[101, 2627, 1108, 3104, 1124, 15703, 136, 102, 3104, 1124, 15703, 1108, 170, 16797, 8284, 102]


## Example: Predict a missing (masked) word in a sequence 

In [92]:
text_1 = "Who was Jim Henson ?"
text_2 = "Jim Henson was a puppeteer"
indexed_tokens = tokenizer.encode(text_1, text_2, add_special_tokens=True)

# Mask a token that we will try to predict back with `BertForMaskedLM`
masked_index = 8
print(tokenizer.decode(indexed_tokens[masked_index]))

Jim


In [93]:
# replace the token to be masked with a special mask_token_id
indexed_tokens[masked_index] = tokenizer.mask_token_id
tokens_tensor = torch.tensor([indexed_tokens])
print(tokenizer.decode(indexed_tokens[masked_index]))

[MASK]


In [94]:
# load a pre-trained `BertForMaskedLM`, a Bert Model with a language modeling head on top.
masked_lm_model = torch.hub.load('huggingface/pytorch-transformers', 'modelForMaskedLM', 'bert-base-cased')

with torch.no_grad():
 predictions = masked_lm_model(tokens_tensor)

# the model predicts a distribution over the token vocabulary for each input token
print(tokens_tensor)
print(torch.argmax(predictions[0][0], dim=1))

Using cache found in /Users/andrea/.cache/torch/hub/huggingface_pytorch-transformers_master
Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tensor([[ 101, 2627, 1108, 3104, 1124, 15703, 136, 102, 103, 1124,
 15703, 1108, 170, 16797, 8284, 102]])
tensor([ 119, 2627, 1108, 3104, 1124, 15703, 136, 119, 3104, 1124,
 15703, 1108, 170, 16797, 119, 119])


In [95]:
# note: special characters are turned into '.' 
print(tokenizer.decode(101),tokenizer.decode(102),tokenizer.decode(119))

[CLS] [SEP] .


In [96]:
# the prediction for the masked token can be recovered looking at 'masked_index'
predicted_token = torch.argmax(predictions[0][0], dim=1)[masked_index]
tokenizer.decode(predicted_token)

'Jim'

## Example: question answering

In [97]:
# load model and tokenizer which are appropriate for question answering
question_answering_model = torch.hub.load('huggingface/pytorch-transformers', 'modelForQuestionAnswering', 'bert-large-uncased-whole-word-masking-finetuned-squad')
question_answering_tokenizer = torch.hub.load('huggingface/pytorch-transformers', 'tokenizer', 'bert-large-uncased-whole-word-masking-finetuned-squad')

# The format is paragraph first and then question
text_1 = "Jim Henson was a puppeteer"
text_2 = "Who was Jim Henson ?"
indexed_tokens = question_answering_tokenizer.encode(text_1, text_2, add_special_tokens=True)

# Note that this tokenizer does not split surnames (but puppeteer still gets split)
print(indexed_tokens)
print(question_answering_tokenizer.decode(3958))
print(question_answering_tokenizer.decode(27227))
print(question_answering_tokenizer.decode(13997))
print(question_answering_tokenizer.decode(11510))


Using cache found in /Users/andrea/.cache/torch/hub/huggingface_pytorch-transformers_master
Using cache found in /Users/andrea/.cache/torch/hub/huggingface_pytorch-transformers_master


[101, 3958, 27227, 2001, 1037, 13997, 11510, 102, 2040, 2001, 3958, 27227, 1029, 102]
jim
henson
puppet
##eer


In [98]:
# The model requires to specify the separation between paragraph and question
segments_ids = [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]

segments_tensors = torch.tensor([segments_ids])
tokens_tensor = torch.tensor([indexed_tokens])

# Predict the start and end positions logits
with torch.no_grad():
 out = question_answering_model(tokens_tensor, token_type_ids=segments_tensors)


In [99]:
# The model predicts (the logits of) a distribution over start and end positions 
# of the answer in the paragraph

print(torch.softmax(out.start_logits, dim=1))
print(torch.softmax(out.end_logits, dim=1))

tensor([[1.2956e-03, 1.3718e-01, 2.0828e-03, 4.6422e-04, 1.2775e-01, 7.2051e-01,
 3.5628e-03, 1.2956e-03, 7.7327e-04, 5.0221e-04, 3.0025e-03, 2.2000e-04,
 6.7787e-05, 1.2957e-03]])
tensor([[1.8842e-02, 2.2303e-03, 9.1058e-03, 3.5248e-04, 1.4338e-03, 6.1204e-03,
 9.2058e-01, 1.8841e-02, 2.4989e-04, 2.3315e-04, 2.9274e-04, 2.5887e-03,
 2.9281e-04, 1.8833e-02]])


In [100]:
# get the highest prediction
answer_tokens = indexed_tokens[torch.argmax(out.start_logits):torch.argmax(out.end_logits)+1]
print(answer_tokens)
answer = question_answering_tokenizer.decode(answer_tokens)
print(answer)

[13997, 11510]
puppeteer


## Example: predict if a sentence is a paraphrase of another one

In [101]:
# load a model and tokenizer appropriate for sequence classification
sequence_classification_model = torch.hub.load('huggingface/pytorch-transformers', 'modelForSequenceClassification', 'bert-base-cased-finetuned-mrpc')
sequence_classification_tokenizer = torch.hub.load('huggingface/pytorch-transformers', 'tokenizer', 'bert-base-cased-finetuned-mrpc')

text_1 = "Jim Henson was a puppeteer"
text_2 = "Who was Jim Henson ?"
indexed_tokens = sequence_classification_tokenizer.encode(text_1, text_2, add_special_tokens=True)

# again, the model needs to know when the second sentence starts
segments_ids = [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1]

segments_tensors = torch.tensor([segments_ids])
tokens_tensor = torch.tensor([indexed_tokens])

# Predict the sequence classification logits
with torch.no_grad():
 seq_classif_logits = sequence_classification_model(tokens_tensor, token_type_ids=segments_tensors)

Using cache found in /Users/andrea/.cache/torch/hub/huggingface_pytorch-transformers_master
Using cache found in /Users/andrea/.cache/torch/hub/huggingface_pytorch-transformers_master


In [102]:
print(seq_classif_logits[0])
# class 0 means the two sentences are not paraphrasing each other

tensor([[ 0.9574, -0.2855]])


In [103]:
# a positive example
text_1 = "The new movie is great"
text_2 = "I love the new movie"
indexed_tokens = sequence_classification_tokenizer.encode(text_1, text_2, 
 add_special_tokens=True)

segments_ids = [0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]

segments_tensors = torch.tensor([segments_ids])
tokens_tensor = torch.tensor([indexed_tokens])

with torch.no_grad():
 seq_classif_logits = sequence_classification_model(tokens_tensor, token_type_ids=segments_tensors)

print(seq_classif_logits[0])

tensor([[-0.1473, 1.5291]])
